Part I - A research on Loan data records from prosper in the United States.

by Ayotunde Jeffers Doherty

Introduction

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, interest rate, current loan status, borrower income, and many others. The analysis in this part would be structured to provide simple univariate relationships to multivariate relationships, this research would provide answers to questions like whether the monthly loan payment has a correlation or any relationship between loan original,amount, what is the spread of lterm of loan in loan status, identifying the frequency of the categorical variables; Term of loan, borrower's employment status, year of loan, and loan status, are there differences between loans depending on how the loan term large the original loan amount was. We have 81 features attributed to each record of loans in the dataset and the descriptions of the features are detailed below:

  • ListingKey Unique key for each listing, same value as the 'key' used in the listing object in the API.
  • ListingNumber The number that uniquely identifies the listing to the public as displayed on the website. -ListingCreationDate The date the listing was created. -CreditGrade The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings. -Term The length of the loan expressed in months. -LoanStatus The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, ---- -PastDue. The PastDue status will be accompanied by a delinquency bucket. -ClosedDate Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses. -BorrowerAPR The Borrower's Annual Percentage Rate (APR) for the loan. -BorrowerRate The Borrower's interest rate for this loan. -LenderYield The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee. -EstimatedEffectiveYield Effective yield is equal to the borrower interest rate (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. Applicable for loans originated after July 2009. -EstimatedLoss Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009. -EstimatedReturn The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009. -ProsperRating (numeric) The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009. -ProsperRating (Alpha) The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009. -ProsperScore A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009. -ListingCategory The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - --Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans -BorrowerState The two letter abbreviation of the state of the address of the borrower at the time the Listing was created. -Occupation The Occupation selected by the Borrower at the time they created the listing. -EmploymentStatus The employment status of the borrower at the time they posted the listing. -EmploymentStatusDuration The length in months of the employment status at the time the listing was created. -IsBorrowerHomeowner A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner. -CurrentlyInGroup Specifies whether or not the Borrower was in a group at the time the listing was created. -GroupKey The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation. -DateCreditPulled The date the credit profile was pulled. -CreditScoreRangeLower The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency. -CreditScoreRangeUpper The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency. -FirstRecordedCreditLine The date the first credit line was opened. -CurrentCreditLines Number of current credit lines at the time the credit profile was pulled. -OpenCreditLines Number of open credit lines at the time the credit profile was pulled. -TotalCreditLinespast7years Number of credit lines in the past seven years at the time the credit profile was pulled. -OpenRevolvingAccounts Number of open revolving accounts at the time the credit profile was pulled. -OpenRevolvingMonthlyPayment Monthly payment on revolving accounts at the time the credit profile was pulled. -InquiriesLast6Months Number of inquiries in the past six months at the time the credit profile was pulled. -TotalInquiries Total number of inquiries at the time the credit profile was pulled. -CurrentDelinquencies Number of accounts delinquent at the time the credit profile was pulled. -AmountDelinquent Dollars delinquent at the time the credit profile was pulled. -DelinquenciesLast7Years Number of delinquencies in the past 7 years at the time the credit profile was pulled. -PublicRecordsLast10Years Number of public records in the past 10 years at the time the credit profile was pulled. -PublicRecordsLast12Months Number of public records in the past 12 months at the time the credit profile was pulled. -RevolvingCreditBalance Dollars of revolving credit at the time the credit profile was pulled. -BankcardUtilization The percentage of available revolving credit that is utilized at the time the credit profile was pulled. -AvailableBankcardCredit The total available credit via bank card at the time the credit profile was pulled. -TotalTrades Number of trade lines ever opened at the time the credit profile was pulled. -TradesNeverDelinquent Number of trades that have never been delinquent at the time the credit profile was pulled. -TradesOpenedLast6Months Number of trades opened in the last 6 months at the time the credit profile was pulled. -DebtToIncomeRatio The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%). -IncomeRange The income range of the borrower at the time the listing was created. -IncomeVerifiable The borrower indicated they have the required documentation to support their income. -StatedMonthlyIncome The monthly income the borrower stated at the time the listing was created. -LoanKey Unique key for each loan. This is the same key that is used in the API. -TotalProsperLoans Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans. -TotalProsperPaymentsBilled Number of on time payments the borrower made on Prosper loans at the time they created this listing. This value will be null if the borrower had no prior loans. -OnTimeProsperPayments Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans. -ProsperPaymentsLessThanOneMonthLate Number of payments the borrower made on Prosper loans that were less than one month late at the time they created this listing. This value will be null if the borrower had no prior loans. -ProsperPaymentsOneMonthPlusLate Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans. -ProsperPrincipalBorrowed Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans. -ProsperPrincipalOutstanding Principal outstanding on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans. -ScorexChangeAtTimeOfListing Borrower's credit score change at the time the credit profile was pulled. This will be the change relative to the borrower's last Prosper loan. This value will be null if the borrower had no prior loans. -LoanCurrentDaysDelinquent The number of days delinquent. -LoanFirstDefaultedCycleNumber The cycle the loan was charged off. If the loan has not charged off the value will be null. -LoanMonthsSinceOrigination Number of months since the loan originated. -LoanNumber Unique numeric value associated with the loan. -LoanOriginalAmount The origination amount of the loan. -LoanOriginationDate The date the loan was originated. -LoanOriginationQuarter The quarter in which the loan was originated. -MemberKey The unique key that is associated with the borrower. This is the same identifier that is used in the API member object. -MonthlyLoanPayment The scheduled monthly loan payment. -LP_CustomerPayments Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries. -LP_CustomerPrincipalPayments Pre charge-off cumulative principal payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries. -LP_InterestandFees Pre charge-off cumulative interest and fees paid by the borrower. If the loan has charged off, this value will exclude any recoveries. -LP_ServiceFees Cumulative service fees paid by the investors who have invested in the loan. -LP_CollectionFees Cumulative collection fees paid by the investors who have invested in the loan. -LP_GrossPrincipalLoss The gross charged off amount of the loan. -LP_NetPrincipalLoss The principal that remains uncollected after any recoveries. -LP_NonPrincipalRecoverypayments The interest and fee component of any recovery payments. The current payment policy applies payments in the following order: Fees, interest, principal. -PercentFunded Percent the listing was funded. -Recommendations Number of recommendations the borrower had at the time the listing was created. -InvestmentFromFriendsCount Number of friends that made an investment in the loan. -InvestmentFromFriendsAmount Dollar amount of investments that were made by friends. -Investors The number of investors that funded the loan.

Preliminary Wrangling

The term of loan expressed in months reveals that the column data type should be categorical, this would be converted from integer to categorical, also the numerical value would be renamed to a categorical term.

Loan status values reveals that the column data type should be categorical, this would be effected in the following line of code.

The past due values in days would have a generic value Past Due, this would be replaced with the following line of code

Convert the loan origination date column from object to datetime and extracting the year out of the column for analysis

Employment status values reveals that the column data type should be categorical with a 2255 sum of missing values representing 1.98% of missing values in the colum, replacing the missing value with the most occured employment status and changing the data type will suffice.

The stated monthly income is captured in 6 decimal places float data type, converting the datatype from flaot to integer would suffice.

Occupation values reveals that the column data type should be categorical with a sum of 3588 missing values representing 3.15% of missing values in the column, the null values in occupation column would be replaced with the key word unidentified afterwards the data type would be changed to categorical.

The borrower state values reveals that the column data type should be categorical with a sum of 5515 missing values representing 4.84% of missing values in the column, replacing the missing value with a unique key word UNKNOWN and changing the data type will suffice.

The structure of the data set is 113,937 rows and 81 columns, implying 113,937 recorded observations with 81 features. The main features of interest to this study include but not limited to the following; loan status, loan term, employment Status, is borrower a homeowner or not, borrower state, income verifiable or not and occupation. To get a better understanding of how this features of interest would be investigated a number of features would support this study which include the following features original loan amount, monthly loan payment, loan current days of delinquency, stated monthly income, investors and recommendations.

In spite of the fact that the dataframe has 81 features, this study is only interested in few of the features, it would be appropriate to shrink the dataframe to the useful columns for the purpose of this study. The following line of code would be executed to extract the necessary column needed for analysis.

Univariate Exploration

The data in the above graph are right-skewed, a case of symmetrical distribution. Most of the loan original amount are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 5000 dollars, the isolated bar indicates an outliers in the ranges between 32000 dollars and 35000 dollars. The data spread is from about 1000 dollars to 3500 dollars.

The data in the above graph are right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars. The data spread is from about zero dollars to 2251 dollars.

For example, the probability that a randomly chosen loan original amount will fall between 5000 dollars and 12000 dollars can be calculated as the area between the density function (graph) and the x-axis in the interval [5000, 12000].

For example, the probability that a randomly chosen monthly lona payment will fall between 300 dollars and 500 dollars can be calculated as the area between the density function (graph) and the x-axis in the interval [300, 500].

Loans disbursed on the medium term in this case 36months has the highest occurence with a count of 87778 representing about 77 percent of loan term duration, leaving the other 23 percent distributed between the long term (60 months) and short term (12months)loan duration.

Those who are employed has the highest occurence in the employment status category with a count of 69557, those who are retired got the lowest occurence in the employment status category, it's more likely to disburse a laon to working class compare to a retired individual.

The year 2013 had the highest number of loan disbursment with a occurence of 34345, followed by the year 2012 and 2014 respectively at second and third position, the least loan disbursement occured in the year 2005 with a occurence of 22 loan disbursement.

The loan status data distribution further indicates that few borrowewrs cancelled their loans and that majority of the borrowers has a current an serviceable loan, those who defaulted in servicing their loan are about 10 percent of those who have current and servicable laon. Those borrowers who are past due has one or more due day.

Those borrowers who own a home are slightly above those who do not own a home, the difference here is not quite significant.

Those borrowers whose income is verifiable has the highest occurence, this is quite understandable in this aspect as it is more likely to obtain a loan if the source of income is well define and verifiable, so to speak the probabability of securing a loan if high with a verfiable income.

The image above displays the state where the borrower originates from, with california with a very large fonts indicates how frequent the state is relative to other states. Wyoming and Maine ranked among the least state the borrower originates from, the null values in the state feature was replaced with unknown as we can visualize from the image above.

The image above displays the occupation of the borrowers, with professional and computer programmer with a very large fonts indicates how frequent the occupations are relative to other occupations. community college and labor retail ranked among the least occurence of occupation of the borrower.

The figure above depicts the distribution of the stated monthly income with outliers indicated with an astericks, due to the nature of the data the first quartile and the rest cannot be visualized from this figure.

The figure above depicts the distribution of the loan original amount with skewness to the right and with outliers indicated with an astericks out of the whiskers. The 25th percentile of the loan original amount variable falls slightly before 5000 dollars, while the median falls above the 5000 dollars, the 75th percentile falls short of 15000 dollars.

The figure above depicts the distribution of the monthly loan payment with skewness to the right and with outliers indicated with an astericks out of the whiskers. The 25th percentile, median and 75th percentile of the monthly loan payment variable falls short of 500 dollars.

The figure above depicts the distribution of the investors with skewness to the right and outliers indicated with an astericks out of the whiskers. The 25th percentile, median and 75th percentile of the investors variable falls short of 200 investors.

Summary

To analyse the loan with respect the year, the loan origination date column was converted from object datatype to datetime, afterwards the year was extracted from the datetime before setting the data type of the extracted year column as categorical variable, also the loan term values was trasform from the original values 12months, 36months, and 60months to short term, medium term and long term respectively to make for a better behavior as a categorical variable. The loan status has values respresenting past due in a number of categories of days, these values were replaced with a single value named 'past due' regardless of the number of days. The borrower state values were transformed from state abbrevation to full text without leaving out the stated monthy income and monthy loan payment variable out from transformation, these variables were converted from float to integer for consistency with the loan amount data type. The occupation column was transformed from object data type to categorical data type.

For the distribution of term, loans disbursed on the medium term in this case 36months has the highest occurence with a count of 87778 representing about 77 percent of loan term duration, leaving the other 23 percent distributed between the long term (60 months) and short term (12months)loan duration. the loan has distribution over 10 years between 2005 t0 2014, the year 2013 had the highest number of loan disbursement with a occurence of 34345, followed by the year 2012 and 2014 respectively at second and third position, the least loan disbursement occured in the year 2005 with a occurence of 22 loan disbursement. The distibution of the loan original amount is right-skewed, a case of symmetrical distribution. Most of the loan original amount are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 5000 dollars, there are outliers in the ranges of 32000 dollars and 35000 dollars. The distribution of monthly loan payment is right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars.

Bivariate Exploration

Apparently the medium term has the highest occurence of loan duration in the distribution from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. other categories can be visualized from the figure above.

The home ownership variable has the true and false values evenly distributed from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 30478 were true(home owner), while 26098 were false(no home ownership). The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home.

The verifiable income variable has the true value as the highest occurence in the distribution from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 52434 has true (verifiable income) , while 4142 were false(no verifiable income). The past due category has a spread of 1857 verifiable income borrowers, and 210 non income verifiable borrowers. The completed loan status category has verifiable income borrowers of 35292 while 2792 does not have a verifiable income.

The loan original amount has the highest mean value of the year 2014 with a lowest mean value in the year 2005, for the stated monthly income the year 2005 has the highest mean value with year 2008 with the lowest mean value.

The loan original amount has the highest mean value in the current loan status category and a lowest mean value in the cancelled loan category, for the stated monthly income the loan status final payment in progress has the highest mean value with loan status cancelled with the lowest mean values.

The loan original amount has the highest mean value in true value of verifiable income variable, and a lowest mean value in the false value of verifiable income variable, for the stated monthly income variable the highest mean value is true value of verifiable income variable with a lowest mean value in the false value of verifiable income variable.

The above figure illustrates that majority of the values fall between 0 doallrs and 25000 dollars, as this indicates no positive correlation between the two variables.

The above figure illustrates a positive correlation between the two variables, as the original loan amount increases the monthly loan payment increase relatively.

Summary

The verifiable income variable has the true value as the highest occurence in the distribution from earlier findings, the spread over the loan status category on the current loan status can be figure out as 52434 has true (verifiable income), while 4142 were false(no verifiable income). The past due category has a spread of 1857 verifiable income borrowers, and 210 non income verifiable borrowers. The completed loan status category has verifiable income borrowers of 35292 while 2792 does not have a verifiable income. Apparently the medium term has the highest occurence of loan duration in the distribution from earlier findings, the spread over the loan status category on the current loan status can be figure out as 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. The home ownership variable has the true and false values evenly distributed from earlier findings, the spread over the loan status category on the current loan status can be figure out as 30478 were true(home owner), while 26098 were false(no home ownership). The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home. There exist a positive correlation between the two variables, as the original loan amount increases the monthly loan payment increase relatively.

Multivariate Exploration

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by term of loan.

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by year of loan.

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by verifiable income.

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by home ownership.

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by employment status.

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by loan status.

From the figure above we can deduce the correlation by the heatmap, we could vizualize we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them.

Summary

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across a scatterplot categorized by term of loan, year, verifiable income, home ownership, loan status, and employment status plotted on different scatter plot to better depict the spread. We could also deduce from the findings the correlation of continuous numerical variable by the heatmap, we could see we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them.

Conclusions

Before carrying out this study a number of transformation was carried out from the gathered dataset, to begin with the data cleaning process the loan origination date column was converted from object datatype to datetime, afterwards the year was extracted from the datetime before setting the data type of the extracted year column as categorical variable, also the loan term values was trasform from the original values 12months, 36months, and 60months to short term, medium term and long term respectively to make for a better behavior as a categorical variable. The loan status has values respresenting past due in a number of categories of days, these values were replaced with a single value named 'past due' regardless of the number of days. The borrower state values were transformed from state abbrevation to full text without leaving out the stated monthy income and monthy loan payment variable out from transformation, these variables were converted from float to integer for consistency with the loan amount data type. The occupation column was transformed from object data type to categorical data type. Owing to the fact that the features are 81 in number the variables of interest for te purpose of this study were pulled together into a new new dataframe, saved as a comma-seperated values to be referenced for exploration and visualization.

For the distribution of term, loans disbursed on the medium term in this case 36months has the highest occurence with a count of 87778 representing about 77 percent of loan term duration, leaving the other 23 percent distributed between the long term (60 months) and short term (12months)loan duration. the loan has distribution over 10 years between 2005 t0 2014, the year 2013 had the highest number of loan disbursement with a occurence of 34345, followed by the year 2012 and 2014 respectively at second and third position, the least loan disbursement occured in the year 2005 with a occurence of 22 loan disbursement. The distibution of the loan original amount is right-skewed, a case of symmetrical distribution. Most of the loan original amount are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 5000 dollars, the isolated bar indicates an outliers in the ranges between 32000 dollars and 35000 dollars. The distribution of monthly loan payment is right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars. Finally, We could also deduce from the findings the correlation of continuous numerical variable by the heatmap, we could see we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them.

The income verifiable variable has the true value as the highest occurence in the distribution from earlier findings, the spread over the loan status category on the current loan status can be figure out as 52434 has true (verifiable income), while 4142 were false(no verifiable income). The past due category has a spread of 1857 verifiable income borrowers, and 210 non income verifiable borrowers. The completed loan status category has verifiable income borrowers of 35292 while 2792 does not have a verifiable income. Apparently the medium term has the highest occurence of loan duration in the distribution from earlier findings, the spread over the loan status category on the current loan status can be figure out as 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. The home ownership variable has the true and false values evenly distributed from earlier findings, the spread over the loan status category on the current loan status can be figure out as 30478 were true(home owner), while 26098 were false(no home ownership). The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home. There exist a positive correlation between the two variables, as the original loan amount increases the monthly loan payment increase relatively. From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across a scatterplot categorized by term of loan, year, verifiable income, home ownership, loan status, and employment status plotted on different scatter plot to better depict the spread. We could also deduce from the findings the correlation of continuous numerical variable by the heatmap, we could see we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them.